Github Repository Link
Background and Motivation
Reddit is known as the “Front Page of the Internet” and is a popular forum especially among young people where users can post anything and everything. Unlike other social media platforms the majority of the Reddit users remain anonymous. We believe that the anonymity of the forum allows us to train and test NLP models. It has a large international community and a lot of programming related content. We want to use data from the Reddit forum in order to better understand the popularity of Programming languages among Reddit users.
Additionally we want to compare it to data from the Stackoverflow forum. The Stackoverflow forum is a forum that is more focussing on solving programming related Problems.
We want to evaluate what programming languages are being discussed in both forums and compare them. For reaching our aim we want to use Visualization and Machine Learning methods based on text data but also use the quantification that we get from the upvotes and number of comments.
Research Questions
- Can we decide which topic/language a certain post is about?
- How do number of upvotes, comments and number of posts correlate to popularity?
- How does the popularity of programming languages change over time?
- Can we predict the popularity of programming languages in the future?
- How do the two platforms compare based on programming languages?
Design overview
For the Reddit posts the plan is to use an API from Reddit to get data sets for a certain time range and a number of specific Subreddits. The choice of the Subreddits is crucial for the quality and expressiveness of our data and will be based on some prior research on interesting Subreddits regarding programming. From this data we can then get the Subreddit, title, text, upvotes and various metadata.
Data set example
|
data.subreddit
|
data.title
|
data.id
|
data.created
|
data.created_utc
|
data.upvote_ratio
|
data.ups
|
data.score
|
data.num_comments
|
|
coding
|
Back-End VS Front-End Framework | 6 J.S. Frameworks Experts Love - Untied Blogs
|
nh0yzf
|
1621547972
|
1621519172
|
0.25
|
0
|
0
|
1
|
|
coding
|
File Descriptor Limits
|
ngzeep
|
1621543958
|
1621515158
|
0.50
|
0
|
0
|
0
|
|
coding
|
Introduction to Continuous Profiling
|
ngy73c
|
1621540423
|
1621511623
|
0.82
|
21
|
21
|
3
|
Data Preprocessing
- Stackoverflow Data Dumps : https://archive.org/download/stackexchange
- Using the different stackoverflow dumps ( posts, tags, users, votes,posthistory,comments etc.)
- Merging this separate data dumps to get relevant data and make it comparable with Reddit data before using it for our analysis and visualisation
- Reddit Data :
- Fetching the Reddit data from the pushshift API
- Finding relevant subreddits
- Selecting the features that should be used
- Textpreprocessing:
- Removing punctuation and stopwords
- Stemming
- Vectorization
Explorative Data Analyis and Visualization
- Exploratory data analysis using box plots, histograms, scatter plot etc. on upvotes and number of comments and number of posts for every programming language
- Showing the trend with a line plot and confidence interval
- Using 2D embedding of the posts and using a scatterplot to show similarity
- Comparing stackoverflow and Reddit in the plots
Classification of topics
- Using topic modeling using LDA - Latent Dirichlet Allocation
- Using decision trees or ensemble methods
- Using sentiment analysis and clustering to evaluate if a post is a professional question or an opinion
Prediction of future popularity
- ARIMA
- SARIMA
- Exponential Smoothing
- Random Forest (based on CART)
Shiny application
- Using shiny as a tool for interactive exploration of the data set
- Sliders for selecting a time range of interest
- Checkboxes or Dropdown menus to select different programming languages
- Bonus: Shiny offers the opportunity of refreshing the data sets on a web page making it possible to get new insights everyday
